Drawing Representative Samples from Large Databases
نویسندگان
چکیده
Sampling has been used in areas like selectivity estimation (Hou & Ozsoyoglu, 1991; Haas & Swami, 1992, Jermaine, 2003; Lipton, Naughton & Schnerder, 1990; Wu, Agrawal, & Abbadi, 2001), OLAP (Acharya, Gibbons, & Poosala, 2000), clustering (Agrawal, Gehrke, Gunopulos, & Raghavan, 1998; Palmer & Faloutsos, 2000), and spatial data mining (Xu, Ester, Kriegel, & Sander, 1998). Due to its importance, sampling has been incorporated into modern database systems. The uniform random sampling has been used in various applications. However, it has also been criticized for its uniform treatment of objects that have non-uniform probability distributions. Consider the Gallup poll for a Federal election as an example. The sample is constructed by randomly selecting residences’ telephone numbers. Unfortunately, the sample selected is not truly representative of the actual voters on the election. A major reason is that statistics have shown that most voters between ages 18 and 24 do not cast their ballots, while most senior citizens go to the poll-booths on Election Day. Since Gallup’s sample does not take this into account, the survey could deviate substantially from the actual election results. Finding representative samples is also important for many data mining tasks. For example, a carmaker may like to add desirable features in its new luxury car model. Since not all people are equally likely to buy the cars, only from a representative sample of potential luxury car buyers can most attractive features be revealed. Consider another example in deriving association rules from market basket data, recalling that the goal was to place items often purchased together in near locations. While serving ordinary customers, the store would like to pay some special tribute to customers who are handicapped, pregnant, elderly, and etcetera. A uniform sampling may not be able to include enough such under-populated people. However, by giving higher inclusion probabilities to (the transaction records of) these under-populated customers in sampling, the special care can be reflected in the association rules. To find representative samples for populations with non-uniform probability distributions, some remedies, such as the density biased sampling (Palmer & Faloutsos, 2000) and the Acceptance/Rejection (AR) sampling (Olken, 1993), have been proposed. The density-biased sampling is specifically designed for applications where the probability of a group of objects is inversely proportional to its size. The AR sampling, based on the “acceptance/ rejection” approach (Rubinstein, 1981), aims for all probability distributions and is probably the most general approach discussed in the database literature. We are interested in finding a general, efficient, and accurate sampling method applicable to all probability distributions. In this research, we develop a Metropolis sampling method, based on the Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953), to draw representative samples. As it will be clear, the sample generated by this method is bona fide representative.
منابع مشابه
Smart Query Definition for Content-Based Search in Large Sets of Graphs
Graphs are used in various application areas such as chemical, social or shareholder network analysis. Finding relevant graphs in large graph databases is thereby an important problem. Such search starts with the definition of the query object. Defining the query graph quickly and effectively so that it matches meaningful data in the database is difficult. In this paper, we introduce a system, ...
متن کاملA fast layout algorithm for protein interaction networks
MOTIVATION Graph drawing algorithms are often used for visualizing relational information, but a naive implementation of a graph drawing algorithm encounters real difficulties when drawing large-scale graphs such as protein interaction networks. RESULTS We have developed a new, extremely fast layout algorithm for visualizing large-scale protein interaction networks in the three-dimensional sp...
متن کاملTowards content-based retrieval of technical drawings through high-dimensional indexing
This paper presents a new approach to classify, index and retrieve technical drawings by content. Our work uses spatial relationships, visual elements and high-dimensional indexing mechanisms to retrieve complex drawings from CAD databases. This contrasts with conventional approaches which use mostly textual metadata for the same purpose. Creative designers and draftspeople often re–use data fr...
متن کاملClustering of highly homologous sequences to reduce the size of large protein databases
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive datab...
متن کاملSampling Strategies for Bag-of-Features Image Classification
Bag-of-features representations have recently become popular for content based image classification owing to their simplicity and good performance. They evolved from texton methods in texture analysis. The basic idea is to treat images as loose collections of independent patches, sampling a representative set of patches from the image, evaluating a visual descriptor vector for each patch indepe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015